The Contingency Table and Metrics

	True +	True -
Pred +	TP	FP (Type I Error)	→	PPV/Precision
Pred -	FN (Type II Error)	TN	→	NPV
	↓	↓	↘
	Sensitivity / TPR / Recall	Specificity / TNR		Accuracy

Metrics - Pred given Truth

We’re doing $P(\hat{y}|y)$ with these.

Accuracy

How well overall did you do with the True labels?

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Many caveats, the most important being imbalanced classes. If your labels are 90% “Doesn’t have Rare Disease” you’ll have a trained model that’s really good at saying you don’t have the disease. You’ll have super high accuracy. But is that a good metric?

Sensitivity / TPR / Recall

How many of the True Positives did you get right?

\text{Sensitivity} = \frac{TP}{TP + FN}

Specificity / TNR

How many of the True Negatives did you get right?

\text{Specificity} = \frac{TN}{TN + FP}

F1 Score

This is the Harmonic Mean of Precision and Recall.

\begin{align*} F1 = \frac{2}{\frac{1}{\text{Precision}} + \frac{1}{\text{Recall}}} &= 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} \\ &= \frac{2TP}{2TP + FP + FN} \\ &= \frac{TP}{TP + \frac{1}{2}(FP + FN)} \end{align*}

So the F1 Score suffers when you fuck with a lot of FP and FNs (fucking up being that the arithmetic mean of the two is high). This is a good thing!

Now you can have a Macro F1 which calculates the score and averages them and treats all classes equally. Or you can do a Weighted F1 which accounts for class imbalance and weights scores based on instances. This last part may be better than Accuracy in some cases.

AUROC and ROC

ROC plots the TPR (Sensitivity) versus FPR (1 - Sensitivity) for each “operating point” or threshold. The area under the curve is AUROC and is $\in [0,1]$ . Hides information on Class Imbalance effects, Calibration, and Prevalence effects.

What does this tell you? It’s all about ranking. What is the probability that if I picked a (Sick Patient, Healthy Patient) tuple, my model will rank the sick patient higher?

Calibration

Small example. Imagine you have 10 patients. 4 actually have the condition (call them sick), 6 don’t (healthy). Your model gives each a risk score between 0 and 1. Here are the scores, sorted highest to lowest, with their true status:

Patient	Risk Score	Truly sick?
A	0.95	✅ sick
B	0.88	✅ sick
C	0.72	✅ sick
D	0.65	❌ healthy
E	0.51	✅ sick
F	0.40	❌ healthy
G	0.33	❌ healthy
H	0.21	❌ healthy
I	0.15	❌ healthy
J	0.08	❌ healthy

Now you look at (4 sick $\times$ 6 healthy $=$ 24) tuples. Looking at the table above, you have 23 correct answers (rankings!). Your AUROC is a solid $\frac{23}{24} = 0.96$ . Hooray! Not so fast. If you divide each Risk Score by 10, you will still get the same AUROC. That’s Calibration for you; this says nothing about calibration!

Using this to Pick Thresholds

You can pick a threshold¹ using the curve but that’s not a nice thing to do in healthcare in general. You have to account for the cost of misses, cost of false alarms, resources, and so on. If only it were that simple.

Metrics - Truth given Pred

We’re flipping and doing $P(y|\hat{y})$ with these.

Positive Predictive Value / Precision

Depends on Prevalence (# of positive cases)

\text{Prevalence} = \frac{TP}{TP + FP}

Negative Predictive Value

Doesn’t come up much really.

\text{NPV} = \frac{TN}{TN + FN}

Other Metrics

Risk Ratio

Risk Ratio is the probability of the outcome in the exposed divided by the probability of outcome in unexposed group.

RR = \frac{\frac{TP}{TP+FN}}{\frac{FP}{FP + TN}}

Odds Ratio

OR = \frac{TP/FN}{FP/TN}

Odds of a probability are $\frac{p}{1 - p}$ . Odds are stationed around 1. If it’s 1.5, 50% more, if it’s 0.7, 30% less (“protective” in some cases.)

As simple as picking the highest value of $(\text{Sensitivity} + \text{Specificity} - 1)$ 🥳 ↩

Metrics - Pred given Truth​

Accuracy​

Sensitivity / TPR / Recall​

Specificity / TNR​

F1 Score​

AUROC and ROC​

Calibration​

Using this to Pick Thresholds​

Metrics - Truth given Pred​

Positive Predictive Value / Precision​

Negative Predictive Value​

Other Metrics​

Risk Ratio​

Odds Ratio​

Footnotes​